Introduction

This is a tutorial, as the final project, for CMSC320 - Introduction to Data Science. In this tutorial, I will examine the dataset of “Uber Cars Demand in NYC” from Kaggle.

Motivation

Newyork city (NYC) is considered as one of the crowdest cities nationwide. Therefore, the demand of rideshare is extremely high during the days. The city is so popular with the yellow cab. However, since Uber was introduced to the people, the demand of using Uber, as a rideshare service, has been increasing dramatically. In this project, I will examine whether there are factors might affect the Uber Cars Demand in NYC.

Resouces

The main dataset for this project if the file Uber_metadata.csv, which is a combination of:

1/ “Uber Pickups in New York City, from 01/01/2015 to 30/06/2015. (by kaggle.com)”, which contains million of records of Uber pickups in NYC in different districts.

3/ Records of weather from National Centers For Environment Information (NOAA)

4/ Holidays in NYC

There are in total 13 variables in this dataset:

1/ pickup_dt: time that uber pickup passenger

2/ district: the district in NYC of the pickup

3/ pickups: the total pickup at the pickup time

4/ wspd: speed of wind (m/h)

5/ vsby: visibility (miles to nearest 10)

6/ temp: temperature (F)

7/ dewp: dew point (F) => higher dew point means more moisture to the air

8/ slpe: sea level pressure => low pressure causes more clouds and precipipation, otherwise sunny and clear weather.

9/ snde: snow depth (inches)

10/ hday: holiday (Y/N)

11/ pcp01: rain in last 1 hour.

12/ pcp06: rain in last 6 hour.

13/ pcp24: rain in last 24 hour.

Data Preparation:

First I need to load the data from the csv file

# Load the file Uber_metadata.csv then format the column
uber_df <- read_csv("uber.csv", col_types = cols(
  district = col_factor(levels = c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island")), 
  hday = col_factor(levels = c("Y", "N")), 
  pcp01 = col_double(), 
  pcp06 = col_double(), 
  pcp24 = col_double(), 
  pickup_dt = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
  snde = col_number())) %>% data.frame()

uber_df <- uber_df %>% filter(district != "NA")
head(uber_df)
##             pickup_dt      district pickups wspd vsby temp dewp   slpe
## 1 2015-01-01 01:00:00         Bronx     152    5   10   30    7 1023.5
## 2 2015-01-01 01:00:00      Brooklyn    1519    5   10   30    7 1023.5
## 3 2015-01-01 01:00:00           EWR       0    5   10   30    7 1023.5
## 4 2015-01-01 01:00:00     Manhattan    5258    5   10   30    7 1023.5
## 5 2015-01-01 01:00:00        Queens     405    5   10   30    7 1023.5
## 6 2015-01-01 01:00:00 Staten Island       6    5   10   30    7 1023.5
##   pcp01 pcp06 pcp24 snde hday
## 1     0     0     0    0    Y
## 2     0     0     0    0    Y
## 3     0     0     0    0    Y
## 4     0     0     0    0    Y
## 5     0     0     0    0    Y
## 6     0     0     0    0    Y

Singel Variable

In this section, I am examining each variable, by plotting, to gain the general ideas about the dataset

Pickups

ggplot(uber_df, aes(pickups)) +
  geom_histogram() +
  scale_x_sqrt() +
  scale_y_sqrt() + 
  labs(x = "Pickups",
         y = "Total")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I do square root both x and y axis to have a better view of the left side of the plot. Here we notice that the plot is skew on the left side which is similar to a union of normal distribution. It may come from the different distribution of pickups throughout different districts.

ggplot(uber_df, aes(pickups)) +
  geom_histogram(aes(fill = district)) +
  scale_x_sqrt(breaks = c()) +
  scale_y_sqrt() +
  labs(x = "Pickups",
         y = "Total")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distinguishing the districts by colors gives me a better observation such as, a lot of 0 pickups are from Bronx, EWR Staten Island, etc. which makes sense since it is abnormal to pickup an Uber inside EWR airport, where is mostly dominated by cabs. Manhanttan seems to have the highest demands of Uber by this plot as well.

ggplot(uber_df, aes(pickups)) +
  geom_histogram() +
  scale_x_sqrt() +
  facet_wrap(~ district, ncol = 2, scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot, again, makes better view about the pickups distribution through the districts. Manhattan has the highest demands in this plot as what we had seen by the previous plot. Very few pickups from EWR, the airport, and Staten Island.

Weather

Since the weather may be the key factor that affect the pickup number, it is helpful to plot the data of weather variables here.

# Categorizing the uder dataframe
uber_cat <- uber_df %>% spread(district, pickups, fill = 0)

weather <- melt(uber_cat %>% select(wspd:snde)) #all the weather variables
## No id variables; using all as measure variables
ggplot(weather, aes(value)) +
  geom_histogram() +
  facet_wrap(~variable , scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Wind speed

ggplot(uber_cat, aes_string("wspd")) + geom_histogram(binwidth = 2) + labs(x = "Miles/Hour (mph)",
         y = "Total")

As observe, regularly, the wind speed is around 5mph which is weak. Also, the maximum speed it can rarely get is around 22mph which is not really strong, so wind speed seems to be not a significant factor to the pickups.

Visibility

ggplot(uber_cat, aes_string("vsby")) + geom_histogram(binwidth = 0.1) + labs(x = "Visibility",
         y = "Total") +
  scale_y_log10(breaks = c(0, 10, 100, 1000)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 27 rows containing missing values (geom_bar).

Sumarizing and finding how many hours that the visibility is less than 10.

summary(uber_cat$vsb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.10   10.00    8.82   10.00   10.00
uber_cat %>% filter(vsby < 10) %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1120

This may have an effect on our model since there were 1120 hours of unclear visiblity.

Temperature

ggplot(uber_cat, aes_string("temp")) + geom_histogram() + labs(x = "Visibility",
         y = "Total") + scale_x_continuous(breaks = seq(0,90,5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Summarizing the temparature variable

summary(uber_cat$temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   31.50   45.00   47.49   64.00   89.00

As seen, the temprature in range between 2 and 89 F degree. The distribution of temperature has two peaks (bi-modal) at 35 and 60 degree.

Dew point

ggplot(uber_cat, aes_string("dewp")) + geom_histogram() + labs(x = "Dew Point",
         y = "Total")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is similar to the plot of temperature since dew point is correlated with temperature.

Sea Level Pressure

ggplot(uber_cat, aes_string("slpe")) + geom_histogram() + labs(x = "Sea Level Pressure (millibars)",
         y = "Total")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sea level pressure directly affects the weather condition in the negative way that low pressure means cloudy and high precipitation. Therefore, it may impact the pickups. As plotting, the sea level pressure has a normal distribution of mode around 1022 millibars.

Precipitation

prec <- uber_cat %>% select(starts_with('pcp')) %>% 
  gather('precipitation', 'inches', 1:3)

ggplot(prec, aes(inches)) +
  geom_histogram() +
  scale_x_log10() +
  facet_wrap(~precipitation, ncol = 1)
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10226 rows containing non-finite values (stat_bin).

There are three main measurements in this plot: rain for the last hour, last 6 hours and last 24 hours. Thest may have a great impact on the rides.

Snow depth

Another natural factor is snowing may affect the rides.

ggplot(uber_cat, aes_string("snde")) + geom_histogram() + labs(x = "Snow Depth (inches)",
         y = "Total") +
  scale_x_sqrt() +
  scale_y_sqrt()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is rarely snow during the time range of this dataset.

uber_cat %>% filter(snde > 0) %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1341

In deed, there were only 1341 hours of snow.

Summary

So far, most of the variables have normal distributions. Some has bimodal and geometric distribution. In overall, I believe district, time, holidays and precipation are the greatest factors which affect the Uber pickups.

Multi Variables

Pickup and Datetime

ggplot(uber_cat, aes(yday, pickups)) +
  geom_jitter(alpha = 0.1) +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.25), 
            linetype = 2, color = 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.5), 
            color = 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.75), 
            linetype = 2, color = 'red') +
  geom_smooth(method='gam') +
  labs(x='Days', y='Pickups') +
  scale_x_continuous(breaks = c('1 Jan.' = 0, '1 Feb.' = 31, '1 Mar.' = 59, 
                                '1 Apr.' = 90, '1 May' = 120, '1 Jun.' = 151, 
                                '30 Jun.' = 181))

Here we can notice there is a pattern in this plot. In overall, there are around 26 peaks which is also the number of weeks of the dataset time range. The number of pickups is increasing overtime.

Pickup Per Day

ggplot(uber_cat, aes(wday, pickups)) +
  geom_boxplot() + labs(x='Week Day', y='Pickups')

The pattern here is more obvious with demand of rides is increasing during the week, low on Monday, but then higher on weekend.

Pickup Per Hour

ggplot(uber_cat, aes(hour, pickups)) +
  geom_jitter(alpha = 0.2) +
  geom_smooth() + labs(x='Hour of Day', y='Pickups')
## `geom_smooth()` using method = 'gam'

Pattern here is demands get low around 5am in the morning and then the demands get higher during the day, especially in the evening. Peaks at around 8am and evening since those are the time people commute to work back and forth.

hm <- lm(formula = pickups ~ poly(hour,7), data = uber_cat)
summary(hm)
## 
## Call:
## lm(formula = pickups ~ poly(hour, 7), data = uber_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4856.0  -660.3   -89.3   651.2  6009.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3283.33      17.27 190.130   <2e-16 ***
## poly(hour, 7)1  76657.05    1138.04  67.359   <2e-16 ***
## poly(hour, 7)2  26013.41    1138.04  22.858   <2e-16 ***
## poly(hour, 7)3 -32137.87    1138.04 -28.240   <2e-16 ***
## poly(hour, 7)4  11276.08    1138.04   9.908   <2e-16 ***
## poly(hour, 7)5 -26670.94    1138.04 -23.436   <2e-16 ***
## poly(hour, 7)6    557.23    1138.04   0.490    0.624    
## poly(hour, 7)7  18877.73    1138.04  16.588   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1138 on 4335 degrees of freedom
## Multiple R-squared:   0.61,  Adjusted R-squared:  0.6094 
## F-statistic: 968.6 on 7 and 4335 DF,  p-value: < 2.2e-16

Here we wee hour of the day strongly effects the rides since it can explore around 61% of the data.

Working and Nonworking Day

ggplot(uber_cat, aes(workday, pickups)) +
  geom_boxplot() + labs(x = 'Workday', y = 'Pickups')

We can see that there is a small difference of demand of rides between work days and holidays.

Pickup and Temperature

ggplot(uber_cat, aes(temp, pickups)) +
  geom_jitter(alpha = 0.2) +
  geom_smooth() + labs(x = 'Temperature (F)', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

The demand increases rapidly when temperature gets over 75 degrees. The relation is clearlier in the below plot.

uber_cat <- uber_cat %>% mutate(over_75 = ifelse(temp > 75, 'Y', 'N'))
uber_ndf <- uber_ndf %>% mutate(over_75 = ifelse(temp > 75, 'Y', 'N'))

ggplot(uber_cat, aes(over_75, pickups)) +
  geom_boxplot() + labs(x = 'Over 75 F Degrees', y = 'Pickups')

Temperature and Datetime

ggplot(uber_cat, aes(pickup_dt, temp)) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
  scale_y_continuous(breaks = seq(0,80,5)) +
  scale_x_datetime() + labs(x = 'Pickup Datetime', y = 'Temperature')
## `geom_smooth()` using method = 'gam'

Dew Point and Temperature

ggplot(uber_cat, aes(temp, dewp)) +
  geom_jitter(alpha = 0.2) +
  geom_smooth(method = lm) + labs(x = 'Temperature', y = 'Dew Point')

These two variables correlate strongly, so one of the two is good enough for the model.

Pickup and Wind Speed

ggplot(uber_ndf, aes(wspd, pickups)) +
  geom_jitter(alpha = 0.05) + 
  geom_smooth() +
  scale_y_sqrt() +
  coord_cartesian(ylim = c(0, 2500)) + labs(x = 'Wind Speed', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

Again, here the correlation is not much diffrent so I believe the wind speed has no effects on the rides. These plots below generate the same idea about the correlation between Pickup and other variables.

Pickup and Visibility

ggplot(uber_cat, aes(vsby, pickups)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Visibility', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

Pickup and Sea Level Pressure

ggplot(uber_cat, aes(slpe, pickups)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Sea Level Pressure', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

Pickup and Precipitation

ggplot(uber_cat, aes(pcp01, pickups)) +
  xlim(0,quantile(uber_cat$pcp01, 0.95)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Precipitation Last 1 Hour', y = 'Pickups')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 214 rows containing non-finite values (stat_smooth).
## Warning: Removed 2220 rows containing missing values (geom_point).

ggplot(uber_cat, aes(pcp06, pickups)) +
  xlim(0,quantile(uber_cat$pcp06, 0.95)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Precipitation Last 6 Hour', y = 'Pickups')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 218 rows containing non-finite values (stat_smooth).
## Warning: Removed 1981 rows containing missing values (geom_point).

ggplot(uber_cat, aes(pcp24, pickups)) +
  xlim(0,quantile(uber_cat$pcp24, 0.95)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Precipitation Last 24 Hour', y = 'Pickups')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 218 rows containing non-finite values (stat_smooth).
## Warning: Removed 1579 rows containing missing values (geom_point).

The three variables of precipitation do not seem to have an effect on rides.

Pickup and Snow Depth

ggplot(uber_cat, aes(snde, pickups)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth() + labs(x = 'Snow Depth (inches)', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

Summary

Through this section, I figured out that the time factor has greater effect than the natural factors like weather variables. The demand of rides relies heavily on time of the day which is able to explore about 61% of the data. Also, day of the week draws a pattern of demands during the week. Also, the number of rides had been increasing during the time range of the dataset (January to June). In the other hand, weather variables have weak relationship to the number of rides, except the temperature when demands get its peak of temperature over 75 degrees. Then it is suffices to dig more into the multi variable relationships as below.

District and Hour of Day

ggplot(uber_ndf, aes(hour, pickups)) +
  geom_jitter(alpha = 0.3, aes(colour = district)) +
  geom_smooth(aes(color = district)) +
  scale_y_log10() + labs(x = 'Hour', y = 'Pickups')
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## `geom_smooth()` using method = 'gam'
## Warning: Removed 5567 rows containing non-finite values (stat_smooth).

By this plot, we can conclude the strong relationship between the time of day in different districts and the number of pickups. Most the districts follow the same pattern with the exception that Staten Island and EWR have low and random demand of rides with the reason mentioned in previous section. Therefore, it is suffices to model the four major districts Bronx, Brooklyn, Manhattan and Queens.

Working and Nonworking days

uber4 <- uber_ndf %>% 
  filter(district %in% c('Manhattan', 'Brooklyn', 'Queens', 'Bronx')) %>% 
  droplevels()

ggplot(uber4, aes(hour, pickups)) +
  geom_jitter(alpha = 0.3, aes(colour = workday)) +
  geom_smooth(aes(color = workday)) +
  facet_wrap(~ district, scales = 'free', ncol = 2) + labs(x = 'Hour', y = 'Pickups')
## `geom_smooth()` using method = 'gam'

We can see that non working days slightly change the pattern but they don’t have such heavy impacts on the day’s demand.

Temperature and Rain

ggplot(uber_cat, aes(hour, Brooklyn)) +
  geom_jitter(alpha = 0.4, aes(color = temp > 75)) +
  geom_smooth(aes(color = temp > 75))
## `geom_smooth()` using method = 'gam'

ggplot(uber_cat, aes(hour, Brooklyn)) +
  geom_jitter( alpha = 0.4, aes(color = pcp01 > 0)) +
  geom_smooth(aes(color = pcp01 > 0))
## `geom_smooth()` using method = 'gam'

From these plots, we can conclude temperature and rain have no great impacts on the demand of Uber rides.

Summary

Throughout this section, I can confirm that there is a pattern of demand of rides, during the day and week, in 4 major districts Bronx, Manhattan, Queen, and Brooklyn. Also, there is none of weather variables that affect the demand of rides.

Final Summary and Reflection

Pickup per Hour by District

uber_ndf$district <- factor(uber_ndf$district, 
                            levels = c('Manhattan', 'Brooklyn', 
                                        'Queens', 'Bronx', 
                                        'Staten Island', 'EWR'))

ggplot(uber_ndf, aes(pickups)) +
  geom_histogram(aes(fill = district), bins = 50) +
  scale_x_sqrt() +
  facet_wrap(~ district, ncol = 2, scales = 'free') +
  labs(x = 'Pickups per hour', y = 'Pickups')

It is mainly normal (Brooklyn, Queens, Bronx) and bimodal (Manhattan) distribution within the 4 major districts of NYC. It is because the increase of demand during the day (time to commute to work back and forth). Also, Staten Island has a geometric distributon while EWR is nearly zero since there were very little demand of rides within these two districts.

Pickup per Week by District

ma <- ggplot(uber_cat, aes(x = wday, y = hour, fill = Manhattan)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'Manhattan', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

bn <- ggplot(uber_cat, aes(x = wday, y = hour, fill = Brooklyn)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'Brooklyn', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

qu <- ggplot(uber_cat, aes(x = wday, y = hour, fill = Queens)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'Queens', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

bx <- ggplot(uber_cat, aes(x = wday, y = hour, fill = Bronx)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'Bronx', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

st <- ggplot(uber_cat, aes(x = wday, y = hour, fill = `Staten Island`)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'Staten Island', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

ew <- ggplot(uber_cat, aes(x = wday, y = hour, fill = EWR)) +
  geom_tile() +
  scale_fill_distiller(palette = 'Spectral') +
  labs(title = 'EWR', x = 'Day', y = 'Time', fill = 'Pickups per hour') +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(ma, bn, qu, bx, st, ew)

The heat maps of demand among 4 major districts follow the same pattern by day and week. It gets low demand in midnight then increases in the morning, keeps stable in the afternoon and rises rapidly in the evening. During the week, the demand of Uber rides seems slow on Monday but especially high during the weekend. Manhanttan and Brooklyn are the best items to draw the pattern. In the other hand, State Island has random demands but still quitely follow the pattern while EWR has almost zero demand.

Final Refelction

During this examination, I figured out that the weather variables have no or weak impact on the demand of Uber rides, which contradicts to the assumption. However, the time of day and week have a strong correlation to the rideshare pickups. Also, the total demand seems to increase within the time range of the observations. From these information, I am able to come up with the conclusion as well as a prediction of the increase rapidly of Uber demands in NYC, one of the busiest cities in the country. Based on this analysis, a person can have a well preparation before scheduling a pickups by time and location. Also, Uber, or epscially the drivers, can use the model to locate their service in the right time and right place to maximize the earning.